55 research outputs found

    Performance of numerical algorithms for low-rank tensor operations in tensor-train / matrix-product-states format

    Get PDF
    This talk discusses the node-level performance of numerical algorithms for handling high-dimensional problems in a compressed tensor format. It focusses on two problems in particular: (1) approximating large (dense) data (lossy compression) and (2) solving linear systems in the tensor-train  / matrix-product states format. For both problems, we optimize the required underlying linear algebra operations, respectively the mapping of the high-level algorithm to (potentially less accurate) lower-level operations. In particular, we suggest improvements for costly orthogonalization and truncation steps based on a high-performance implementation of a "Q-less" tall-skinny QR decomposition. Further optimizations for solving linear systems include memory layout optimizations for faster tensor contractions and a simple generic preconditioner. We show performance results on todays multi-core CPUs where we obtain a speedup of up ~50x over the reference implementation for the lossy compression, and up to ~5x for solving linear systems

    Performance and productivity of parallel python programming: a study with a CFD test case

    Get PDF
    The programming language Python is widely used to create rapidly compact software. However, compared to low-level programming languages like C or Fortran low performance is preventing its use for HPC applications. Efficient parallel programming of multi-core systems and graphic cards is generally a complex task. Python with add-ons might provide a simple approach to program those systems. This paper evaluates the performance of Python implementations with different libraries and compares it to implementations in C or Fortran. As a test case from the field of computational fluid dynamics (CFD) a part of a rotor simulation code was selected. Fortran versions of this code were available for use on single-core, multi-core and graphic-card systems. For all these computer systems, multiple compact versions of the code were implemented in Python with different libraries. For performance analysis of the rotor simulation kernel, a performance model was developed. This model was then employed to assess the performance reached with the different implementations. Performance tests showed that an implementation with Python syntax is six times slower than Fortran on single-core systems. The performance on multi-core systems and graphic cards is about a tenth of the Fortran implementations. A higher performance was achieved by a hybrid implementation in C and Python using Cython. The latter reached about half of the performance of the Fortran implementation

    The Jacobi-Davidson Eigensolver on GPU Clusters

    Get PDF
    Compared to multi-core processors, GPUs typically offer a higher memory bandwidth, which makes them attractive for memory-bounded codes like sparse linear and eigenvalue solvers. The fundamental performance issue we encounter when implementing such methods for modern GPUs is that the ratio between memory bandwidth and memory capacity is significantly higher than for CPUs. When solving large-scale problems one therefore has to use more compute nodes and is quickly forced into the strong scaling limit. In this paper we consider an advanced eigensolver (the block Jacobi-Davidson QR method [1]), implemented in the PHIST software (https://bitbucket.org/essex/phist/). We aim to provide a blueprint and a framework for implementing other iterative solvers like Krylov subspace methods for modern architectures that have relatively small high-bandwidth memory. The techniques we explore to reduce the memory footprint of our solver include mixed precision arithmetic and recalculating quantities `on-the-fly'. We use performance models to back our results theoretically and ensure performance portability

    Performance of high-order SVD approximation: reading the data twice is enough

    Get PDF
    Performance of high-order SVD approximation: reading the data twice is enough ============================================================================= This talk considers the problem of calculating a low-rank tensor approximation of some large dense data. We focus on the tensor train SVD (TT-SVD) but the approach can be transferred to other low-rank tensor formats such as general tree tensor networks. In the TT-SVD algorithm, the dominant building block consists of singular value decompositions of tall-skinny matrices. Therefore, the computational performance is bound by data transfers on current hardware as long as the desired tensor ranks are sufficiently small. Based on a simple roofline performance model we show that under reasonable assumptions the minimal runtime is of the order of reading the data twice. We present an almost optimal, distributed parallel implementation that is based on a specialized rank-preserving TSQR step. Moreover, we discuss important algorithmic details and compare our results with common implementations that are often about 50x slower than optimal. References: Oseledets: "Tensor-Train Decomposition", SISC 2011 Grasedyck and Hackbusch: "An Introduction to Hierarchical (H-) Rank and TT-Rank of Tensors with Examples", CMAM 2011 Demmel et. al.: "Communication Avoiding Rank Revealing QR Factorization with Column Pivoting", SIMAX 2015 Williams et. al.: "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM 200

    Performance of Low-Rank Tensor Algorithms

    Get PDF
    We discuss low-rank tensor algorithms and in particular algorithms for the tensor-train (TT) format (known as MPS in computational physics). We focus on the required building blocks and model their node-level performance on modern multi-core CPUs. More specifically, we consider the lossy compression of large dense data (TT-SVD), as well as linear solvers in TT format (TT-MALS, TT-GMRES). For the data compression, we derive the optimal roofline runtime for the complete algorithm based on the two main building blocks in an optimized implementation: Q-less TSQR and tall-skinny matrix-matrix multiplication. For the low-rank linear solvers, we categorize the different kinds of building blocks according to performance characteristics and show possible performance optimizations. While all required tensor operations can be mapped onto standard BLAS/LAPACK routines theoretically, faster implementations need specific performance optimizations: These include (1) avoiding costly singular-value decompositions (SVDs), and (2) employing special fused operations for sequences of memory-bound tensor-contractions and reshaping operations, as well as (3) tracking properties of tensors such as orthogonalities. We show the effect of the different optimizations and compare the runtime of our implementation with other tensor libraries

    GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

    Get PDF
    While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.Comment: 32 pages, 11 figure

    Rotor Blade Modeling in a Helicopter Multi Body Simulation Based on the Floating Frame of Reference Formulation

    Get PDF
    The Floating Frame of Reference formulation was chosen to include the Beam Advanced Model in DLR’s Versatile Aeromechanics Simulation Tool. During the development and concurrent testing of the model in the field of helicopter rotor dynamics, some particular shortcomings have become apparent. These mainly – but not exclusively – concern inertial loads affecting the flexible motion of beams. This paper treats the related physical phenomena, and proposes enhancements to the model which remedy the deficiencies of the baseline method. Particular attention is given to the introduction of rotational shape functions to account e.g. for the propeller moment and the consideration of an accelerated Floating Frame of Reference to address the blade attachment’s radial offset from the rotor center in the centrifugal field. Furthermore, the application of external loads (e.g. airloads) away from the beam’s nodes or off the beam axis is addressed as a prerequisite for independent structural and aerodynamic discretization. Finally, the modal reduction under centrifugal loading is considered. The individual model upgrades are verified based on analytical reference results of appropriate rotor dynamics test cases. The enhancements are necessary for simulating flexible helicopter rotor blades within a Multi Body System – a feature required for sophisticated simulation scenarios in which the limitations of conventional rotor models (e.g. constant rotational hub speed) are exceeded

    PHIST: a Pipelined, Hybrid-parallel Iterative Solver Toolkit

    Get PDF
    The increasing complexity of hardware and software environments in high-performance computing poses big challenges on the development of sustainable and hardware-efcient numerical software. This paper addresses these challenges in the context of sparse solvers. Existing solutions typically target sustainability, flexibility or performance, but rarely all of them. Our new library PHIST provides implementations of solvers for sparse linear systems and eigenvalue problems. It is a productivity platform for performance-aware developers of algorithms and application software with abstractions that do not obscure the view on hardware-software interaction. The PHIST software architecture and the PHIST development process were designed to overcome shortcomings of existing packages. An interface layer for basic sparse linear algebra functionality that can be provided by multiple backends ensures sustainability, and PHIST supports common techniques for improving scalability and performance of algorithms such as blocking and kernel fusion. We showcase these concepts using the PHIST implementation of a block Jacobi-Davidson solver for non-Hermitian and generalized eigenproblems. We study its performance on a multi-core CPU, a GPU and a large-scale many-core system. Furthermore, we show how an existing implementation of a block Krylov-Schur method in the Trilinos package Anasazi can beneft from the performance engineering techniques used in PHIST
    • …
    corecore